About the project

This is a first assignment of the Introduction to Open Data Science 2018 course. Git / rstudio / rmarkdown exercises.

Github for this exercise


Chapter 2: Week 2 regression analysis

library(data.table)

# Read in learning2014 created with create_learning2014.R
learning2014<-fread("learning2014.txt")

Dimensions and variable characteristics of learning2014

# Number of records and number of variables in the dataset
dim(learning2014)
## [1] 166   7
# Variable types (integers are integer variables, numerical are continuous floating number variables)
str(learning2014)
## Classes 'data.table' and 'data.frame':   166 obs. of  7 variables:
##  $ Gender  : int  2 1 2 1 1 2 1 2 1 2 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ Deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ Stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ Surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Gender: Male = 1, Female = 2
Age: Age (in years) derived from the date of birth
Attitude: Global attitude toward statistics
Deep: Deep approach
Stra: Strategic approach
Surf: Surface approach
Points: Yhteispisteet (max kaikista)

Variable summary statistics

summary(learning2014)
##      Gender           Age           Attitude          Deep      
##  Min.   :1.000   Min.   :17.00   Min.   :14.00   Min.   :1.583  
##  1st Qu.:1.000   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.333  
##  Median :2.000   Median :22.00   Median :32.00   Median :3.667  
##  Mean   :1.663   Mean   :25.51   Mean   :31.43   Mean   :3.680  
##  3rd Qu.:2.000   3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.083  
##  Max.   :2.000   Max.   :55.00   Max.   :50.00   Max.   :4.917  
##       Stra            Surf           Points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

Histograms of variables to visualizes distribution shape

par(mfrow=c(3,3))
sapply(1:ncol(learning2014),function(i) hist(learning2014[[i]], breaks=20, xlab=NA,main=names(learning2014)[i]))

Density plots

par(mfrow=c(3,3))
sapply(1:ncol(learning2014),function(i) plot(density(learning2014[[i]]), xlab=NA,main=names(learning2014)[i]))

Age has a long right tail. Attitude, Deep, Stra, and Surf could be considered approximately normally distributed. From Surf and Stra histograms we can clearly see that they are sum variables and have multiple modes (peaks in the histogram).

Pairwise relationships between variables

pairs(learning2014)

Visually at least Attitude seems to have relationship with Points. To further investigate this, let’s additionally draw a correlation plot.

library(corrplot)
## corrplot 0.84 loaded
corrplot(cor(learning2014, method="spearman"), type="lower")

Attitude clearly correlates with Points. Furthermore, Age, gender, stra, and surf all seem to have a small correlation. Given we are allowed to choose only three variables for this exercise, age and gender with attitude seem most interesting combination.

Linear regression model

Lets try first a model with Age, Gender (basic demographic covariates) and Attitude as a third one.

#library(rms)
summary(lm(Points~Age+I(as.factor(Gender))*Attitude, data=learning2014))
## 
## Call:
## lm(formula = Points ~ Age + I(as.factor(Gender)) * Attitude, 
##     data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.3457  -3.3677   0.1811   3.8529  10.2281 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    11.21591    4.18555   2.680 0.008135 ** 
## Age                            -0.07516    0.05379  -1.397 0.164246    
## I(as.factor(Gender))2           2.84438    4.45715   0.638 0.524274    
## Attitude                        0.41480    0.11115   3.732 0.000263 ***
## I(as.factor(Gender))2:Attitude -0.07583    0.13155  -0.576 0.565115    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.326 on 161 degrees of freedom
## Multiple R-squared:  0.2034, Adjusted R-squared:  0.1836 
## F-statistic: 10.28 on 4 and 161 DF,  p-value: 1.944e-07

Apparently age and gender have no statistical effect in the model, let’s keep only attitude.

summary(lm(Points~Attitude, data=learning2014))
## 
## Call:
## lm(formula = Points ~ Attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

As attitude has ambiguous dimensions, let’s mean scale it (mean=0, sd=1) for easier interpretation.

summary(lm(Points~I(scale(Attitude)), data=learning2014))
## 
## Call:
## lm(formula = Points ~ I(scale(Attitude)), data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         22.7169     0.4129  55.019  < 2e-16 ***
## I(scale(Attitude))   2.5733     0.4141   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Now we can say that Attitude is highly significant in this model (p=4e-9). It’s effect (beta) is 2.5733, which means that one standard deviation increase in Attitude increases Points by apx. 2.57.

Multiple R-squared is 0.1906 which means that Attitude alone explains 19% of variability of Points.

Model diagnostics

plot(lm(Points~Attitude, data=learning2014))

Residuals do not seem to have any non-linear patterns. QQ-plot indicates that residuals are approximately normally distributed which further confirms that linear model assumptions are met. Scale-location points shows no apparent heteroscedasticity although cases 145, 56, and 36 should potentially be further investigated as interesting outlier cases. However, none of these cases (or any other cases), fall beyond Cook’s distance in the fourth plot. If they would, it would mean they probably should be excluded from the analysis as outliers that influence model too much.

In summary, we can conclude, that assumptions of linear model are probably very well met.


Chapter 3: Week 3 Logistic regression analysis

library(data.table)

# Read in learning2014 created with create_learning2014.R
alc<-fread("data/alc.txt")
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

This “Student Performance Data Set” dataset is derived from UCI Machine learning repository. It’s described as:

“This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).”

We joined these two datasets (mat and por) together, overlapping variables were either averaged (quantitative) or the one from math selected (qualitative). Alc_use and high_use were calculated from the data.

Analysis

Let’s select four variables for the analysis:

vars<-c("age","sex","Medu","goout")
str(alc[,vars,with=F])
## Classes 'data.table' and 'data.frame':   382 obs. of  4 variables:
##  $ age  : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ sex  : chr  "F" "F" "F" "F" ...
##  $ Medu : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ goout: int  4 3 2 2 2 2 4 4 2 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Age is a continuous variable, gender is dichotomous, mother’s education is multinomial, and going out is ordinal/continuous (likert).

library(ggplot2)
library(GGally)
#ggpairs(alc, columns=c(vars,"high_use"))

1. Age

library(ggridges)
library(magrittr)
library(dplyr)
ggplot(data=alc, aes(x=high_use, y=age)) + geom_boxplot()

# Differenes in mean and standard deviation
alc %>% group_by(high_use) %>% summarise(mean_age=mean(age), sd_age=sd(age))
## # A tibble: 2 x 3
##   high_use mean_age sd_age
##   <lgl>       <dbl>  <dbl>
## 1 FALSE        16.5   1.15
## 2 TRUE         16.8   1.21
# Univariate non-parametric test
wilcox.test(age~high_use, data=alc)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  age by high_use
## W = 13266, p-value = 0.05185
## alternative hypothesis: true location shift is not equal to 0

Distribution seems to be somewhat skewed, thus wilcox.test (mann-whitney). It indicates assocation, although not very strong: Older students are likely to have higher consumption which makes intuitive sense.

2. Gender

ggplot(data=alc, aes(x=sex, fill=high_use)) + geom_bar()

table(alc$high_use, alc$sex)
##        
##           F   M
##   FALSE 157 113
##   TRUE   41  71
chisq.test(x=alc$high_use, y=alc$sex)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  alc$high_use and alc$sex
## X-squared = 13.863, df = 1, p-value = 0.0001967

Gender seems to have strong univariate association as anticipated.

3. Mother’s education

ggplot(data=alc, aes(x=Medu, fill=high_use)) + geom_bar()

table(alc$high_use, alc$Medu)
##        
##          0  1  2  3  4
##   FALSE  1 33 80 59 97
##   TRUE   2 18 18 36 38
chisq.test(y=alc$high_use, x=as.factor(alc$Medu))
## Warning in chisq.test(y = alc$high_use, x = as.factor(alc$Medu)): Chi-
## squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  as.factor(alc$Medu) and alc$high_use
## X-squared = 12.031, df = 4, p-value = 0.01713

Mother’s education seems to associate slightly which we expected a priori.

4. Going out with friends (1-5)

ggplot(data=alc, aes(x=high_use, y=goout)) + geom_boxplot()

# Differenes in mean and standard deviation
alc %>% group_by(high_use) %>% summarise(mean_age=mean(goout), sd_age=sd(goout))
## # A tibble: 2 x 3
##   high_use mean_age sd_age
##   <lgl>       <dbl>  <dbl>
## 1 FALSE        2.86   1.04
## 2 TRUE         3.73   1.10
# Univariate non-parametric test
wilcox.test(goout~high_use, data=alc)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  goout by high_use
## W = 8577, p-value = 5.88e-12
## alternative hypothesis: true location shift is not equal to 0

“Going out” indiciates a strong univariate association, and makes naively sense.

Logistic regression

# Let's use Harrell's excellent regression model strategies package
library(rms)
# Let's make sure Medu is factor and scale likert scale "going out" (center to zero, very high=1, very low=-1)
alc[,Medu := as.factor(Medu)]
alc[,goout := (goout - 3)/2]

lrm(high_use~age+sex+Medu+goout, data=alc)
## Logistic Regression Model
##  
##  lrm(formula = high_use ~ age + sex + Medu + goout, data = alc)
##  
##                        Model Likelihood     Discrimination    Rank Discrim.    
##                           Ratio Test           Indexes           Indexes       
##  Obs           382    LR chi2      78.20    R2       0.264    C       0.760    
##   FALSE        270    d.f.             7    g        1.301    Dxy     0.519    
##   TRUE         112    Pr(> chi2) <0.0001    gr       3.672    gamma   0.523    
##  max |deriv| 6e-12                          gp       0.226    tau-a   0.216    
##                                             Brier    0.163                     
##  
##            Coef    S.E.   Wald Z Pr(>|Z|)
##  Intercept -0.2993 2.2897 -0.13  0.8960  
##  age        0.0703 0.1094  0.64  0.5204  
##  sex=M      0.9357 0.2551  3.67  0.0002  
##  Medu=1    -2.0598 1.2909 -1.60  0.1106  
##  Medu=2    -3.1142 1.2888 -2.42  0.0157  
##  Medu=3    -2.0764 1.2741 -1.63  0.1032  
##  Medu=4    -2.6082 1.2745 -2.05  0.0407  
##  goout      1.5668 0.2468  6.35  <0.0001 
## 
confint.default(lrm(high_use~age+sex+Medu+goout, data=alc))
##                2.5 %     97.5 %
## Intercept -4.7870567  4.1883960
## age       -0.1441360  0.2847951
## sex=M      0.4356889  1.4357720
## Medu=1    -4.5898916  0.4702347
## Medu=2    -5.6401508 -0.5881813
## Medu=3    -4.5736251  0.4209207
## Medu=4    -5.1061906 -0.1101853
## goout      1.0831240  2.0504201
ggcoef(glm(high_use~age+sex+Medu+goout, data=alc, family="binomial"), exponentiate = F, errorbar_height=.2, color="blue", sort="ascending")

Results seem quite interesting (below p-value, OR and 95% CI respectively):

Predictive performance

# Remove age from the model (p-value was low)
mod2<-lrm(high_use~sex+Medu+goout, data=alc, x=T, y=T)

# Get predictions an threshold at zero
preds<-as.numeric(predict(mod2)>0)

# Confusion matrix
cmat<-table(preds, as.numeric(alc$high_use))

cmat
##      
## preds   0   1
##     0 254  68
##     1  16  44
# Training error rate (proportion is incorrectly classified individuals)
((cmat[1,2]+cmat[2,1])/sum(cmat))
## [1] 0.2198953

Error rate (22%) is better than randomly guessing (50%). However, this is likely rather optimistic as it is in-sample error rate.

Cross-validating

We can do cross-validation natively with rms package.

validate(mod2, method="crossvalidation", B=10)
##           index.orig training    test optimism index.corrected  n
## Dxy           0.5188   0.5188  0.4588   0.0600          0.4588 10
## R2            0.2625   0.2643  0.2410   0.0233          0.2392 10
## Intercept     0.0000   0.0000 -0.0677   0.0677         -0.0677 10
## Slope         1.0000   1.0000  0.9827   0.0173          0.9827 10
## Emax          0.0000   0.0000  0.0181   0.0181          0.0181 10
## D             0.2010   0.2025  0.1657   0.0368          0.1642 10
## U            -0.0052  -0.0058  0.0538  -0.0596          0.0544 10
## Q             0.2062   0.2083  0.1118   0.0964          0.1098 10
## B             0.1634   0.1630  0.1715  -0.0085          0.1718 10
## g             1.2971   1.3068  1.2473   0.0595          1.2377 10
## gp            0.2252   0.2253  0.1886   0.0367          0.1885 10

Which shows optimism in original model without cross validation (e.g. in Somer’s D)

We can also include all variables to the model and perform backwards stepwise regression in rms package:

# Add all variables to the model (excluding those that were used to create the outcome variable)
mod3<-lrm(high_use~., data=alc[,-c("alc_use","Dalc","Walc"),with=F],x=T, y=T)

validate(mod3, method="crossvalidation", B=10, bw=T, pr=F)
## 
##      Backwards Step-down - Original Model
## 
##  Deleted    Chi-Sq d.f. P      Residual d.f. P      AIC   
##  Mjob        4.57  4    0.3339  4.57     4   0.3339  -3.43
##  Fjob        4.59  4    0.3316  9.17     8   0.3284  -6.83
##  internet    0.00  1    0.9707  9.17     9   0.4219  -8.83
##  schoolsup   0.01  1    0.9086  9.18    10   0.5150 -10.82
##  G3          0.03  1    0.8710  9.21    11   0.6027 -12.79
##  higher      0.05  1    0.8278  9.26    12   0.6810 -14.74
##  Pstatus     0.05  1    0.8156  9.31    13   0.7492 -16.69
##  age         0.08  1    0.7816  9.39    14   0.8055 -18.61
##  failures    0.18  1    0.6746  9.56    15   0.8463 -20.44
##  Fedu        0.21  1    0.6490  9.77    16   0.8784 -22.23
##  famsup      0.37  1    0.5408 10.14    17   0.8975 -23.86
##  reason      4.56  3    0.2071 14.70    20   0.7931 -25.30
##  school      0.31  1    0.5795 15.01    21   0.8224 -26.99
##  romantic    0.46  1    0.4994 15.47    22   0.8415 -28.53
##  nursery     0.50  1    0.4781 15.97    23   0.8566 -30.03
##  freetime    0.74  1    0.3884 16.71    24   0.8606 -31.29
##  famsize     0.99  1    0.3186 17.71    25   0.8545 -32.29
##  guardian    2.99  2    0.2247 20.69    27   0.8004 -33.31
##  health      1.66  1    0.1975 22.35    28   0.7646 -33.65
##  traveltime  2.10  1    0.1469 24.46    29   0.7060 -33.54
##  Medu        8.39  4    0.0783 32.85    33   0.4746 -33.15
##  paid        1.82  1    0.1768 34.67    34   0.4357 -33.33
##  studytime   2.71  1    0.0995 37.39    35   0.3600 -32.61
##  G2          3.14  1    0.0762 40.53    36   0.2772 -31.47
##  G1          0.68  1    0.4103 41.21    37   0.2916 -32.79
##  address     3.76  1    0.0524 44.97    38   0.2030 -31.03
##  activities  3.88  1    0.0490 48.85    39   0.1341 -29.15
##  famrel      5.79  1    0.0161 54.64    40   0.0613 -25.36
##  sex         6.38  1    0.0115 61.02    41   0.0228 -20.98
##  absences    6.59  1    0.0102 67.61    42   0.0074 -16.39
##  goout      15.15  1    0.0001 82.76    43   0.0003  -3.24
## 
## Approximate Estimates after Deleting Factors
## 
##         Coef   S.E. Wald Z        P
## [1,] -0.6665 0.1432 -4.653 3.27e-06
## 
## Factors in Final Model
## 
## None
##           index.orig training    test optimism index.corrected  n
## Dxy           0.0000   0.0000  0.0000   0.0000          0.0000 10
## R2            0.0000   0.0000  0.0000   0.0000          0.0000 10
## Intercept     0.0000   0.0000 -0.9021   0.9021         -0.9021 10
## Slope         1.0000   1.0000  1.0000   0.0000          1.0000 10
## Emax          0.0000   0.0000  0.2218   0.2218          0.2218 10
## D            -0.0026  -0.0029 -0.0263   0.0234         -0.0260 10
## U            -0.0052  -0.0058 -0.0317   0.0258         -0.0311 10
## Q             0.0026   0.0029  0.0053  -0.0024          0.0050 10
## B             0.2072   0.2072  0.2076  -0.0004          0.2076 10
## g             0.0000   0.0000  0.0000   0.0000          0.0000 10
## gp            0.0000   0.0000  0.0000   0.0000          0.0000 10
## 
## Factors Retained in Backwards Elimination
## 
##  school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason nursery
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##  internet guardian traveltime studytime failures schoolsup famsup paid
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##  activities higher romantic famrel freetime goout health absences G1 G2 G3
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##                                                                           
## 
## Frequencies of Numbers of Factors Retained
## 
##  0 
## 10

Housing values in Suburbs of Boston

set.seed(12345)
library(MASS)
library(dplyr)
data(Boston)
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...

Variable details:

  • crim: per capita crime rate by town.
  • zn: proportion of residential land zoned for lots over 25,000 sq.ft.
  • indus: proportion of non-retail business acres per town.
  • chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  • nox: nitrogen oxides concentration (parts per 10 million).
  • rm: average number of rooms per dwelling.
  • age: proportion of owner-occupied units built prior to 1940.
  • dis: weighted mean of distances to five Boston employment centres.
  • rad: index of accessibility to radial highways.
  • tax: full-value property-tax rate per $10,000.
  • ptratio: pupil-teacher ratio by town.
  • black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
  • lstat: lower status of the population (percent).
  • medv: median value of owner-occupied homes in $1000s.
library(GGally)
library(corrplot)
corrplot(cor(Boston, method="spearman"), method = "circle")

newBost<-Boston
newBost$chas<-as.factor(newBost$chas)
newBost$rad<-as.factor(newBost$rad)
ggpairs(newBost)

Many variables seem to have non-normal distribution, many seem even heavy-tailed and few binomial (having two modes, i.e. peaks, such as indus, tax, and rad). Also strong positive and negative correlations are present among several variables.

fLet’s plot density of each variable separately:

par(mfrow=c(4,4), oma = c(5,4,0,0) + 0.1,
          mar = c(0,0,1,1) + 0.1)
tmp<-lapply(1:ncol(Boston), function(i) plot(density(as.numeric(Boston[,i])), main=colnames(Boston)[i]))

And next normalize (scale) these variables to have zero mean and standard deviation of one. This can be somewhat questionable for e.g. heavy-tailed distributions where variance and standard devation may not be defined at all, but it’s requested by the assignment instructions.

Let’s scale and plot new variables (mean has been moved to zero and standard devation scaled to one):

# Scale original variables
newBost<-Boston
newBost<-(apply(newBost, 2, function(col) {
  if (is.numeric(col)) scale(col)
}))

# Plot scaled variable
par(mfrow=c(4,4), oma = c(5,4,0,0) + 0.1,
          mar = c(0,0,1,1) + 0.1)
tmp<-lapply(1:ncol(newBost), function(i) plot(density(as.numeric(Boston[,i])), main=colnames(newBost)[i]))

newBost<-as.data.frame(newBost)

Create a factor with quantiles of crim (and replace original variable). Let’s also do a 80:20 split of the dataset, by creating a holdout indicator.

newBost$crim<-with(newBost, cut(crim, quantile(crim, c(0, 0.25, 0.5, 0.75, 1)), labels = c("Q1", "Q2", "Q3", "Q4")))

holdout<-rbinom(n = nrow(newBost), size = 1, prob = 0.2)
print(table(holdout))
## holdout
##   0   1 
## 393 113
trainset<-newBost[holdout==0,]
testset<-newBost[holdout==1,]

LDA

Estimate model using trainset, predict in testset (remove original) and compare to original crim in testset. Crosstabulate predictions with original values.

library(MASS)
mod1<-lda(crim~., data=trainset)

test_crim<-testset$crim
testset$crim<-predict(mod1, newdata=testset)$class

table(test_crim, testset$crim)
##          
## test_crim Q1 Q2 Q3 Q4
##        Q1 13 17  3  0
##        Q2  3 13 10  0
##        Q3  0  5 25  0
##        Q4  0  0  0 24
cat(paste("Test error rate:", round(mean(test_crim != testset$crim)*100, 1), "%"))
## Test error rate: 33.6 %

K-means clustering

Reload boston and rescale. Sum of squares to find right number of clusters:

# Reload Boston and rescale
newBost<-Boston
newBost<-(apply(newBost, 2, function(col) {
  if (is.numeric(col)) scale(col)
}))

wss <- sapply(1:20, 
              function(k){kmeans(newBost, k, nstart=50, iter.max = 15)$tot.withinss})
wss
##  [1] 7070.000 4576.681 3871.374 3445.946 3074.612 2723.217 2436.911
##  [8] 2248.218 2064.030 1940.011 1849.925 1729.998 1659.495 1598.493
## [15] 1544.558 1476.380 1426.614 1385.056 1333.214 1286.531
plot(1:20, wss,
     type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares")

This approach does’t seem to give any insight. Let’s try Gap statistic:

library(cluster)
cg<-clusGap(newBost, kmeans, 10, B = 100)
plot(cg)

This approach suggest nine (although increasing number to significantly higher levels could be argued). Let’s build a kmeans clustering model with nine clusters and plot that:

km1<-kmeans(newBost, 9)
ggpairs(as.data.frame(newBost), aes(col=as.factor(km1$cluster)))


Week 5 exercise

Read in the data and visualize with ggpairs and spearman correlations.

library(data.table)
library(GGally)
library(corrplot)
library(dplyr)
human<-read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", row.names=1, header=T, sep=",")
summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
ggpairs(human)

corrplot(cor(human, method="spearman"))

All variables seem to be unimodal and most have skewed non-normal distributions, some with heavy tails. Potentially only Edu.Exp could be estimated with a normal distribution. Correlation among Edu2FM, Edu.Exp, Life.Exp, GNI, Mat.Mor, and Ado.Birth seem high (both positive and negative correlations). Labo.FM and Parli.F do not seem to strongly correlate with other variables.

Next we were asked to perform a PCA and draw a biplot using raw and then with scaled data. (Scaling non-normal distributions using standard deviations and mean can be considered somehwat questionnable, but let’s follow the assignment).

pc<-princomp(human)
biplot(pc, cex=c(0.3, 1), main="Non-scaled PCA")

pc<-princomp(scale(human))
biplot(pc, cex=c(0.3, 1), main="Scaled PCA", xlab="PC1, female opportunity dimension (individual)", ylab="PC2, general female participation dimension (society level)")

barplot(pc$sdev)

summary(pc)
## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4
## Standard deviation     2.0641471 1.1360379 0.87222124 0.77634649
## Proportion of Variance 0.5360463 0.1623703 0.09571374 0.07582845
## Cumulative Proportion  0.5360463 0.6984166 0.79413031 0.86995876
##                            Comp.5     Comp.6     Comp.7     Comp.8
## Standard deviation     0.65981755 0.53457325 0.45751637 0.32119948
## Proportion of Variance 0.05477328 0.03595302 0.02633506 0.01297988
## Cumulative Proportion  0.92473204 0.96068506 0.98702012 1.00000000

The non-scaled PCA seems to be highly driven by GNI because it has large absolute variance compared to other variables. Thus, some kind of scaling makes sense, to give all variables same original weight to start with.

In scaled PCA, x-dimension seems to gather variables that are related to individual level freedom / opportunities of women (explains 53% of variance). The y-axis seems to map variables that indicate general society participation level (i.e. politics) of women (explains 16% of variance). Together these two first PCs explain nearly 70% of all variance.

MCA and Tea data

library(FactoMineR)
data(tea)
glimpse(tea)
## Observations: 300
## Variables: 36
## $ breakfast        <fct> breakfast, breakfast, Not.breakfast, Not.brea...
## $ tea.time         <fct> Not.tea time, Not.tea time, tea time, Not.tea...
## $ evening          <fct> Not.evening, Not.evening, evening, Not.evenin...
## $ lunch            <fct> Not.lunch, Not.lunch, Not.lunch, Not.lunch, N...
## $ dinner           <fct> Not.dinner, Not.dinner, dinner, dinner, Not.d...
## $ always           <fct> Not.always, Not.always, Not.always, Not.alway...
## $ home             <fct> home, home, home, home, home, home, home, hom...
## $ work             <fct> Not.work, Not.work, work, Not.work, Not.work,...
## $ tearoom          <fct> Not.tearoom, Not.tearoom, Not.tearoom, Not.te...
## $ friends          <fct> Not.friends, Not.friends, friends, Not.friend...
## $ resto            <fct> Not.resto, Not.resto, resto, Not.resto, Not.r...
## $ pub              <fct> Not.pub, Not.pub, Not.pub, Not.pub, Not.pub, ...
## $ Tea              <fct> black, black, Earl Grey, Earl Grey, Earl Grey...
## $ How              <fct> alone, milk, alone, alone, alone, alone, alon...
## $ sugar            <fct> sugar, No.sugar, No.sugar, sugar, No.sugar, N...
## $ how              <fct> tea bag, tea bag, tea bag, tea bag, tea bag, ...
## $ where            <fct> chain store, chain store, chain store, chain ...
## $ price            <fct> p_unknown, p_variable, p_variable, p_variable...
## $ age              <int> 39, 45, 47, 23, 48, 21, 37, 36, 40, 37, 32, 3...
## $ sex              <fct> M, F, F, M, M, M, M, F, M, M, M, M, M, M, M, ...
## $ SPC              <fct> middle, middle, other worker, student, employ...
## $ Sport            <fct> sportsman, sportsman, sportsman, Not.sportsma...
## $ age_Q            <fct> 35-44, 45-59, 45-59, 15-24, 45-59, 15-24, 35-...
## $ frequency        <fct> 1/day, 1/day, +2/day, 1/day, +2/day, 1/day, 3...
## $ escape.exoticism <fct> Not.escape-exoticism, escape-exoticism, Not.e...
## $ spirituality     <fct> Not.spirituality, Not.spirituality, Not.spiri...
## $ healthy          <fct> healthy, healthy, healthy, healthy, Not.healt...
## $ diuretic         <fct> Not.diuretic, diuretic, diuretic, Not.diureti...
## $ friendliness     <fct> Not.friendliness, Not.friendliness, friendlin...
## $ iron.absorption  <fct> Not.iron absorption, Not.iron absorption, Not...
## $ feminine         <fct> Not.feminine, Not.feminine, Not.feminine, Not...
## $ sophisticated    <fct> Not.sophisticated, Not.sophisticated, Not.sop...
## $ slimming         <fct> No.slimming, No.slimming, No.slimming, No.sli...
## $ exciting         <fct> No.exciting, exciting, No.exciting, No.exciti...
## $ relaxing         <fct> No.relaxing, No.relaxing, relaxing, relaxing,...
## $ effect.on.health <fct> No.effect on health, No.effect on health, No....

All variables seem to be multinomial / categorical apart from age variable, and most are not ordered. The proper way to assess their relationships is not through correlations but using e.g. Cramer’s V:

library(vcd)
tmp<-lapply(1:ncol(tea), function(i) sapply(1:ncol(tea), function(j) {assocstats(table(tea[,i], tea[,j]))$cramer}) )
kor<-as.matrix(as.data.frame(tmp))
rownames(kor)<-colnames(tea)
colnames(kor)<-colnames(tea)
corrplot(kor, order="hclust", method="square")

We can see that age seems to be related to all other variables. Also where, price, and how form correlated cluster as well as breakfast and frequency.

Let’s conduct multiple correspondence analysis. Data description says: “The data used here concern a questionnaire on tea. We asked to 300 individuals how they drink tea (18 questions), what are their product’s perception (12 questions) and some personal details (4 questions).”

Furhtermore, “The first 18 questions are active ones, the 19th is a supplementary quantitative variable (the age) and the last variables are supplementary categorical variables.”

Thus, let’s set last variables (starting from 19) as supplementary ones (age, col 19, as quantitative).

library(factoextra)

mod1<-MCA(tea,quanti.sup=19,quali.sup=20:36, graph=F)
fviz_screeplot(mod1, addlabels = TRUE, ylim = c(0, 45))

fviz_mca_biplot(mod1, 
               repel = F, # Avoid text overlapping (slow if many point)
               ggtheme = theme_minimal())

fviz_mca_var(mod1, choice = "mca.cor", 
            repel = TRUE, # Avoid text overlapping (slow)
            ggtheme = theme_minimal())

# Contributions of rows to dimension 1
fviz_contrib(mod1, choice = "var", axes = 1, top = 15)

# Contributions of rows to dimension 2
fviz_contrib(mod1, choice = "var", axes = 2, top = 15)

# Total contribution to dimension 1 and 2
fviz_contrib(mod1, choice = "var", axes = 1:2, top = 15)

Dim1: 10 Most correlated variables

head(dimdesc(mod1, axes = c(1,2))[[1]]$quali, 10)
##                 R2      p.value
## where    0.4179301 1.255462e-35
## tearoom  0.3718911 6.082138e-32
## how      0.2988286 1.273180e-23
## friends  0.2431995 8.616289e-20
## resto    0.2264676 2.319804e-18
## tea.time 0.1920380 1.652462e-15
## price    0.2160938 4.050469e-14
## pub      0.1472236 5.846592e-12
## work     0.1115359 3.000872e-09
## How      0.1028519 4.796010e-07

Dim2: 10 Most correlated variables

head(dimdesc(mod1, axes = c(1,2))[[2]]$quali, 10)
##                R2      p.value
## where  0.62550194 4.542155e-64
## price  0.56056797 1.837909e-50
## how    0.51288621 4.103156e-47
## Tea    0.16034278 5.359827e-12
## resto  0.05883014 2.165287e-05
## age_Q  0.07663110 9.613084e-05
## dinner 0.04764166 1.385133e-04
## work   0.04334283 2.825934e-04
## sugar  0.03078909 2.286813e-03
## How    0.04300447 4.565763e-03